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A.  Goals  and  Achievements 

y  The  original  proposal  for  this  research  project  states  its  goals  as 
follows:  "The  object  of  this  research  project  is  to  continue  a  broad 
program  of  research  whose  aims  are:  (1)  to  develop  the  quantile  and 
density-quantile  function  formulation  of  statistical  data  analysis  and 
modeling  problems;  (2)  to  develop  robust  methods  of  data  analysis  and 

modeling;  (3)  to  develop  density  estimation  methods;  and  (4)  to  develop 

*  •  •  ' 

minimum  distance  methods  and  approximation  theory  methods.  propose  ^to 
implement  our  theoretical  research  in  algorithms  and  computer  software 
which  provide  methods  useable  by  researchers  concerned  with  important 
scientific  and  social  problems,  and  to  discuss  applications  which  illustrate 
the  applicability  of  the  methods." 

The  approach  to  statistical  reasoning  that  our  research  program  is 
attempting  to  develop  has  reached  a  synthesis  that  warrants  its  own  name; 
we  propose  the  name  ^FUN.STAT.  The  FUN.STAT  domain  of  statistical  data 
model  identification  and  parameter  estimation  combines  (1)  density-quantile 
function  signatures  of  distributions,  (2)  entropy  and  information  measures, 
and  (3) J^ynctional^tatistical  inference.  ^ _  ..... 

The  word  "functional"  is  used  with  several  interpretations:  (a) 
functional  =  useful;  (b)  functional  =  functional  analysis,  as  one  applied 
techniques  of  numerical  analysis,  solutions  of  linear  equations,  and 
approximation  theory;  (c)  functional  =  estimation  of  functions,  and  fitting 
curves  and  surfaces  to  a  discrete  grid  of  points.  Functional  inference  is 
a  branch  of  the  field  of  "abstract  inference"  formulated  by  Grenander. 

FUN.STAT  is  an  approach  to  statistical  graphics  which  argues  that  a 
graph  should  be  a  picture  of  a  function  (and  the  function  should  be  a 
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signature  of  a  probability  model).  FUN.STAT  connotes  the  name  of  a  library 
of  computer  programs  for  statistical  data  analysis  whose  output  provides 
both  graphs  of  functions  and  numerical  diagnostics  of  the  fit  and  complexity 
of  the  functions.  We  currently  have  available  computer  packages  ONESAM, 
TWOSAM,  and  BISAM. 

Some  achievements  of  this  research  program  are  described  in  Section  B 
which  outlines  the  Quantile  Data  Analysis  approach  to  one-sample,  two-sample, 
and  bivariate  sample  statistical  data  anlaysis  problems.  We  believe  that 
we  have  achieved  important  clarifications  of  the  role  of  information  and 
entropy  measures  in  model  identification  and  parameter  estimation  (information 
measures  can  be  elegantly  expressed  in  the  quantile  domain  and  appear  to  be 
more  easily  estimated  in  that  domain). 

Statistical  concepts  introduced  or  emphasized  in  our  research  include 
densi ty-quantile  function,  quanti leadens i ty  function,  score  function,  tail 
exponents,  mode  percentile,  sample  quantile  function,  histogram-quantile 
function,  quantile  box  plot,  cumulative  weighted  spacings  plot,  sample 
entropy,  score  deviation,  19  quantile  values  for  universal  data  summary, 
quantile  bootstrap,  joint  density-quantile  function,  dependence  density 
function,  dependence  entropy,  regression-quantile  function,  Bayes  theorem 
for  quantile  functions,  autoregressive  quantile  densities,  exponential 
dependence  densities,  minimum  distance  estimation  by  reproducing  kernel 
Hilbert  space  norms,  Renyi  entropy  of  order  a.  These  concepts  seem  to  be 
increasingly  accepted  (and  referred  to)  in  the  literature. 

We  believe  that  we  have  made  excellent  progress  towards  achieving  the 
goals  stated  in  our  original  proposal.  A  framework  has  been  developed  for 
integrating  statistical  data  analysis  and  concepts  of  probability  theory. 


« 
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B .  Summary  of  some  of  the  most  impo rtant  resul ts  of  Quantile  and  FUN.STAT 
Data  Analysis 


I.  One  Sample:  Univariate 

The  probability  law  of  a  random  variable  X  is  usually  described  by 
its  distribution  function  F(x)=Pr[X<x] ,  -<®x<°°,  and  probability  density 
function  f(x)=F'(x).  The  quantile  approach  uses 

(1)  Q ( u )  =  F_1(u)  =  inf  {x:F(x)>u}  , 

(2)  q(u)  =  Q1 (u) 

(3)  fQ(u)  =  f (Q(u) )  =  (q(u)}”1  ,  and 

(4)  J(u)  =  -(fQ)'(u) 


A  quick  measure  of  location  is  the  median  Q(0.5).  A  quick  index  of 
scale  is  the  interquartile  range  Q(0. 75)  -  Q(0.25),  formed  from  the 
quartiles  Q(0.25)  and  Q(0.75). 

Quick  measures  of  distributional  shape  are  provided  by  values  (as 
u  tends  to  0  and  1)  of  the  informative  quantile  function  [recently 
introduced  by  Parzen]. 


IQ(u) 


-  QLu)  jlQ(o-5)  _ 

2 { QT0.75)  -  Q  (072577 


0<u<1 . 


We  cannot  emphasize  how  powerful  the  IQ  function  appears  to  be  in 
practice  as  a  tool  for  the  diagnosis  of  distributional  shapes. 

The  IQ  function  is  independent  of  location  and  scale  parameters. 

It  is  approximately  equivalent  to  normalizing  a  quantile  function  to 
have  the  properties  Q(0.5)  =  0,  Q * (0.5)  =  1.  The  IQ  graph  of  the 
function  provides  us  at  a  glance  with  a  vague  estimate  of  tail  behavior 
as  defined  by  tail  exponents. 
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A  fundamental  description  of  the  tail  behavior  of  distributions 
is  provided  by  the  left  tail  exponent  aQ  and  the  right  tail  exponent 
defined  as  follows: 

fQ(u)  =  ua°  Lq(u)  as  u  +  0 

fQ(u)  =  (1-u)  1  L^(u)  as  u  ■+  1 

where  Lg(u)  and  L^(u)  are  slowly  varying  functions. 

A  function  L(u)  is  slowly  varying  as  u  +  0  if,  for  every  y  >  0, 

1 im  k(M  =  t 

i^O  L(u) 

Tail  behavior  is  defined  in  terms  of  a  tail  exponent  as  follows: 


a<l : 

short  tail 

a=l : 

medium  tail 

a>l : 

long  tail 

Medium  tail  (a=l)  distributions  are  further  classified  by  the  value  of 

u  _  1 im  f (u)  u  1 im  f(u) 

‘i-rii  .  '  iT  ; 

the  letter  h  is  suggested  by  the  notion  of  hazard  function.  We  define 

h  =  0:  medium-long  tail 
0  <  h  <  «>:  medium-medium  tail 
h  *  •:  medium-short  tail 

Extensive  calculations  of  informative  quantile  functions  indicate 
that  the  value  IQq  of  IQ(u)  for  u  near  0  is  a  quick  indicator  of 


left  tail  behavior: 


-0.5  <  IQ  <  0  :  short  left  tail, 

-1.0  <  IQ  <  -0.5:  medium-short  left  tail, 

—  O' 

IQq  <  -1.0:  medium-medium  to  long  left  tail. 

Similarly  the  value  IQ^  of  IQ(u)  for  u  near  1  is  a  quick  indicator  of 
right  tail  behavior: 

0  <  IQ-j  <_  0.5:  short  right  tail, 

0.5  <  I Q-j  <  1.0  :  medium-short  left  tail, 

1.0  <  1Q^:  medium-medium  to  long  right  tail 

An  important  family  of  distributions  is  the  Weibull  with  shape 
parameter  p  .  Its  quantile  function  Q(u)  is  of  the  form  1  ; 

Q(u)  =  p  +  a  Qq(u) 

where 

Q0(u)  =  g  {1°9  O-u)"1}6  • 

Its  density-quantile 

foQ0  (u)  =  (1_u)  tlo9  O-u)-1)1'6 

Its  right  tail  exponent  is  a  =  1.,  and  its  left  tail  exponent  is 
aQ  =  1-B.  Insight  into  the  interpretation  of  informative  quantile 
functions  is  obtained  by  computing  them  for  Weibull  distributions. 

Given  data,  we  distinguish  three  types  of  estimators  of  population 
parameters,  which  we  call:  (1)  fully  non-parametric,  (2)  fully 
parametric,  and  (3)  functional-parametric.  Fully  non-parametric 
estimators  assume  no  model,  and  provide  quick  estimators.  Fully 


6 


parametric  estimators  assume  a  model  known  up  to  a  finite  number  of 

parameters  which  must  be  estimated.  Functional-parametric  estimators 

are  based  on  methods  of  functional  statistical  inference. 

A  fully  non-parametric  estimator  Q(u)  of  Q(j),  given  a  sample  of 

n  distinct  values  X.  <  X0  <...<X  ,  is  defined  by  (for  j=?l,...,n) 

I ;n  c\ n  n;n 


Q(u)  = 


Jld 

n 


<u 


For  a  large  sample,  or  for  grouped  values,  we  form  a  histogram  before 
computing  Q(u)  by  linear  interpolation  at  an  equi-spaced  grid  of  values 
kh,  k*l ,2 . [1/h]  where  usually  h  =  0.01. 


A  quantile  data  analysis  of  the  random  sample 

1.  Forms  sample  distribution  function  F^(x),  sample  quantile 
function  Qx(u),  sample  quantile  density  q(u)  at  a  grid  of 
values  of  u  in  0<u<l. 

2.  Plots  sample  version  of  informative  quantile  function  IQ(u) 

whose  values  as  u  tends  to  0  and  1  indicates  the  tail 
exponents  of  the  probability  law  of  X. 

3.  Determines  standard  distribution  functions  FQ(x)  to  test 

Ho:  F(x)  *  or  <*<“>  "  «  +  "  Vu> 

for  location  and  scale  parameters  p  and  o  to  be  estimated.  A 
test  of  Hq  which  does  not  require  estimation  of  p  and  o  can  be 
based  on  [Parzen  (1979)] 

<*(“)  =  f0Q0(^)  *  °0 
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m 


H ( x )  =  X  F(x)  +  (1-x)  G(x),  x  = 


71 


N 


To  test  the  hypotheses  of  equality  of  distributions,  Hq:  F ( x)  = 

G ( x )  =  H(x),  it  is  customary  in  non-parametric  statistics  to  introduce 

Dx(u)  =  F  H-1 (u) ,  Dy(u)  =  G  H_1 (u) 

with  densities  [equivalent  to  likelihood  ratios] 

d  (U)  =  ,  d  (u)  ,  sXlCitL 

f  H-  (u)  Y  h  H" ' (u) 


Note  that  h  H  ^(u)  =  X  f  H  ^(u)  +  (1-  x)  g  H  ^(u);  therefore 

X  +  (1-X) 


dx(u)  =  { 


f  H"1 (u)  J 


Parzen  (1983)  shows  that  all  conventional  two-sample  nonparametric 
test  procedures  are  functionals  of  the  following  raw  estimator  of 


Dy(u): 


.  Dx(u)  =  {H  F"1 >_1 (u) 


From  which  one  can  form  "pseudo-correlations"  p(v)  and  linear  rank 
statistics  a(J)  with  score  function  J(u), 

p(v)  =  /q  e^7I^UV  d  Dx(u)  >  a(J)  =  /J  J(u)  dDx(u)  , 

*  _ 

and  autoregressive  estimators  dx  m(u)  of  dx(u). 

When  one  observes  several  variables  X^,  X^, —  ,X^i  one 
estimates  functionals  of  D.(u)  =  F  ^  (H  ^(u))  or  D^(u)  = 


FxU)(Fx(k)(u)) 
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III.  One  Samp! e: _ B i  vari a te 

Let  (X^ ,  Xg)  be  jointly  continuous  random  variables  with 
distribution  function  x  (x.j  ,Xg)  =  Pr[X^<_x  ,  *2-x2^  and  density 

fv  v  (x, ,x0).  The  joint  density  quantile  function  is  defined  by 

,Ag  I  c 

fQx  x  (u^  * u2^  -  X  (QX  ^1  ^  *  QX  * 

To  estimate  fQ  we  define 


°X^ ,  X?  (VU2J  "  FX1,X2  ^X^V*  QX2(u2^ 


which  is  the  distribution 


function  of  U,  =  Fv  (X.),  U0  =  FY  (X9);  ii 
•  X1  *  2  *2  2 


has  densi  ty 


^.Xg^VV  *  au*  au2  D(uru2J 


satisfying 


fQx  x  (u-jtUg)  =  fQx  (u^)  fQx  (u2)  d^  x  (uru2) 

/  •  \  i  •  % 

To  estimate  dx  x  from  a  random  sample  (Xj  ,  XgV  j-1,. 


Dy  Y  ~  Fy  y  (Qy  ( tl-1  )  »  Qy  (Up)) 

*1  »*2  '‘i  *a2  a1  1  *2 

a  •* 

and  a  raw  estimator  d^  x  (uj,Ug).  We  smooth  log  d^  X^ul,u2  ^  a 
smooth  estimator  log  dv  v  (u1fu0)  minimizing  a  criterion  similar  to 

- —  A-j  I  C 


I  I  log  d[U1(j),U2(j)]  -  log  d^U^’h  U2(j)]|2 
J  ^ 


where  log  ^as  the  parametric  representation  (exponential  model) 

log  dju,,^)  ^  ^  6V]iV2  exp  i  (0,0,  ♦  u^)  -  *(0^)  ; 
1 »  2 

where  the  summation  is  over  v.  ,v0  =  0,  +  l,...,+m,  and  <p(e  )  is  an 

12  -  -  V1  ,v2 

integrating  factor  to  make  a  probability  density.  The 

foregoing  estimators  have  been  implemented  in  T.  J.  Woodfield  [1982]. 

The  problem  of  choosing  a  best  value  of  the  order  m  is  approached  by 
evaluating  the  entropy  of  d^. 
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D .  Ph.D.  Theses 

Four  Ph.D.  theses  under  Professor  Parzen's  direction  have  been 
completed  with  support  from  the  Army  Research  Office  in  the  years  of 
1979-1982.  The  theses  of  R.  L.  Eubank,  J.  M.  White,  T.  J.  Prihoda,  and 
T.  J.  Woodfield  focused  respectively  on  the  quantile  and  density-quantile 
approach  to  estimation  of  location  and  scale  parameters;  comparison  of  k 
samples;  estimation  of  location  and  scale  differences  of  two  samples;  and 
estimation  of  bivariate  joint  density-quantile  functions.  The  work  of 
S.  Anderson  was  unfortunately  terminated  in  1982  by  his  accidental  death. 
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